This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs)\nfor each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily\nthan those in an original acoustic feature space. The CRBM is expected to automatically discover common features\nlurking in time-series data. When we train two CRBMs for a source and target speaker independently using only\nspeaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are\nfewer phonemes and relatively more speaker individuality than the original acoustic space because the training data\ninclude various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is\nthen concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and\nthe NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the\nCRBMs and the concatenating NN have network-based representation with time dependencies. Through\nvoice-conversion experiments, we confirmed the high performance of our method especially in terms of objective\nevaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN\napproaches.
Loading....